Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: improve error handling #6524

Merged
merged 2 commits into from
Sep 29, 2022

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Sep 26, 2022

Cover letter

We have recently seen two error codes from GnuTLS when running against AWS S3:
"error GnuTLS:-110, The TLS connection was non-properly terminated."
"error GnuTLS:-53, Error in the push function."

Examples:

Handle these by adding EBADR for the 53 case, and an abs() around the code to handle negative error codes in general: the 110 code (ETIMEDOUT) was already handled but only when positive.

Backport Required

  • not a bug fix
  • issue does not exist in previous branches
  • papercut/not impactful enough to backport
  • v22.2.x
  • v22.1.x
  • v21.11.x

UX changes

None

Release notes

Improvements

  • Logging verbosity is reduced when S3 backends unexpectedly close connections

@jcsp jcsp added kind/bug Something isn't working area/cloud-storage Shadow indexing subsystem and removed area/redpanda labels Sep 26, 2022
@jcsp jcsp marked this pull request as ready for review September 26, 2022 16:13
VladLazar
VladLazar previously approved these changes Sep 26, 2022
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Comment on lines 110 to 112
if (auto code = abs(cerr.code().value());
code != ECONNREFUSED && code != ENETUNREACH && code != ETIMEDOUT
&& code != ECONNRESET && code != EPIPE && code != EBADR) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm confused by the commit message. how is this error:

error GnuTLS:-110, The TLS connection was non-properly terminated.

related to handling for standard errno.h codes like ETIMEDOUT? i was under the impression that gnutls error codes were unrelated to standard error codes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a function instead? (so expanding it is just adding one element to the array + we can add a unit test.

static_array<int> codes {{ refused, reset, unreach, timedout, badaddr, pipe }}

for (c in code) {
if code == c; return true
}
return false

kinda thing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, is nice if it's a func in case at some point we do openssl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right. Good catch.

I was assuming that something was munging gnutls errors into vaguely-semantically-equivalent posix errnos, but in fact they're totally separate.

I guess I need to revise this to do a disgusting string comparison on the system_error's message to identify gnutls cases, and then compare them with official gnutls error codes from their header.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. It's not. I assumed gnutls maps it's error codes to linux error codes, but uses the negative for some reason. That doesn't seem to be the case. I guess we should silence specific gnutls errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcsp perhaps you don't need to parse the string. if these gnutls-errors-as-std::system_error are originating from seastar, then I see that seastar is already attaching a static const gnutls_error_category glts_errorc; so we can identify which are errno.h and which are gnutls.

These are a proxy for underlying connection issues, e.g.
- "error GnuTLS:-110, The TLS connection was non-properly terminated."
- "error GnuTLS:-53, Error in the push function."

Treat them as retryable the same way we do other transient
network errors when communicating with an S3 endpoint.
Comment on lines +59 to +62
// The name() of seastar's gnutls_error_category class
constexpr std::string_view gnutls_cateogry_name{"GnuTLS"};

if (e.code().category().name() == gnutls_cateogry_name) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder why seastar didn't put this in a header so we could check the category without string comparison 😢

@jcsp
Copy link
Contributor Author

jcsp commented Sep 27, 2022

This passed tests here: https://buildkite.com/redpanda/redpanda/builds/15766

The release-clang-amd64 check looks incomplete because the job in buildkite was renamed to test-clang-amd64

@jcsp jcsp merged commit a209299 into redpanda-data:dev Sep 29, 2022
@jcsp jcsp deleted the cloud-storage-log-errors branch September 29, 2022 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cloud-storage Shadow indexing subsystem area/redpanda kind/bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants